Stratified Sampling for Extreme Multi-label Data

نویسندگان

چکیده

Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there no method for effectively generating stratified partitions XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren’t always representative entire dataset, and 2) are missing many labels. This can lead to poor generalization ability unreliable performance estimates, as has been established binary multi-class settings. As such, this paper presents a new simple algorithm that efficiently generate datasets with millions unique We also examine label distributions prevailing benchmark splits, investigate issues arise from using unrepresentative subsets data model development. The results highlight difficulty stratifying data, demonstrate importance training evaluation.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extreme Learning Machine for Multi-Label Classification

Xia Sun 1,*, Jingting Xu 1, Changmeng Jiang 1, Jun Feng 1, Su-Shing Chen 2 and Feijuan He 3 1 School of Information Science and Technology, Northwest University, Xi’an 710069, China; [email protected] (J.X.); [email protected] (C.J.); [email protected] (J.F.) 2 Computer Information Science and Engineering, University of Florida, Gainesville, FL 32608, USA; [email protected] 3 Department o...

متن کامل

Deep Extreme Multi-label Learning

Extreme multi-label learning (XML) or classification has been a practical and important problem since the boom of big data. The main challenge lies in the exponential label space which involves 2 possible label sets when the label dimension L is very large, e.g., in millions for Wikipedia labels. This paper is motivated to better explore the label space by building and modeling an explicit labe...

متن کامل

Adversarial Extreme Multi-label Classification

The goal in extreme multi-label classification is to learn a classifier which can assign a small subset of relevant labels to an instance from an extremely large set of target labels. Datasets in extreme classification exhibit a long tail of labels which have small number of positive training instances. In this work, we pose the learning task in extreme classification with large number of tail-...

متن کامل

Distribution Free Confidence Intervals for Quantiles Based on Extreme Order Statistics in a Multi-Sampling Plan

Extended Abstract. Let Xi1 ,..., Xini   ,i=1,2,3,....,k  be independent random samples from distribution $F^{alpha_i}$،  i=1,...,k, where F is an absolutely continuous distribution function and $alpha_i>0$ Also, suppose that these samples are independent. Let Mi,ni and  M'i,ni  respectively, denote the maximum and minimum of the ith sa...

متن کامل

Stratified and Un-stratified Sampling in Data Mining: Bagging

Stratified sampling is often used in opinion polls to reduce standard errors, and it is known as variance reduction technique in sampling theory. The most common approach of resampling method is based on bootstrapping the dataset with replacement. A main purpose of this work is to investigate extensions of the resampling methods in classification problems, specifically we use decision trees, fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-75765-6_27